Classification in Imbalanced Datasets

نویسندگان

  • Thomas Debray
  • Evgueni N. Smirnov
  • Georgi Nalbantov
  • Evgueni Smirnov
چکیده

In this thesis we study the classification task in the presence of class imbalanced data. This task arises in many applications when we are interested in the under-represented (minority) classes. Examples of such applications are related to fraud detection, medical diagnosis and monitoring, text categorization, risk management, information retrieval and filtering. Although there exist many standard approaches to the classification task, most of them have poor generalisation performance on the minority class. This thesis studies well-known approaches to the classification problem in the presence of class imbalanced data, such as Cost-Sensitivity, Bagging for Imbalanced Datasets, MetaCost and SMOTE. The main contribution of the thesis is a new approach to the problem that we call Naive Bayes Sampling. The approach is a generative approach. It generates new instances of the minority class by bootstrapping values of each feature present in the training data. Experiments show the superiority of our approach on 4 UCI datasets and a medical dataset provided by KULeuven.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Proposing a Novel Cost Sensitive Imbalanced Classification Method based on Hybrid of New Fuzzy Cost Assigning Approaches, Fuzzy Clustering and Evolutionary Algorithms

In this paper, a new hybrid methodology is introduced to design a cost-sensitive fuzzy rule-based classification system. A novel cost metric is proposed based on the combination of three different concepts: Entropy, Gini index and DKM criterion. In order to calculate the effective cost of patterns, a hybrid of fuzzy c-means clustering and particle swarm optimization algorithm is utilized. This ...

متن کامل

ارائه‌روش جدید مبتنی‌بر برنامه‌نویسی ژنتیک برای وزن‌دهی قوانین فازی در طبقه‌بندی نامتوازن

In classification problems, we often encounter datasets with different percentage of patterns (i.e. classes with a high pattern percentage and classes with a low pattern percentage). These problems are called “classification Problems with imbalanced data-sets”. Fuzzy rule based classification systems are the most popular fuzzy modeling systems used in pattern classification problems. Rule weights...

متن کامل

First study of the behaviour of genetic fuzzy classifier based on low quality data respect to the preprocessing of low quality imbalanced datasets

There are real-world dataset where we can found classes with a very different percentage of patterns between them, that is to say we have classes represented by many examples (high percentage of patterns) and classes represented by few examples (low percentage of patterns). These kind of datasets receive the name of “imbalanced datasets”. In the field of classification problems the imbalanced d...

متن کامل

Empirical Similarity for Absent Data Generation in Imbalanced Classification

When the training data in a two-class classification problem is overwhelmed by one class, most classification techniques fail to correctly identify the data points belonging to the underrepresented class. We propose Similarity-based Imbalanced Classification (SBIC) that learns patterns in the training data based on an empirical similarity function. To take the imbalanced structure of the traini...

متن کامل

Enhancing Learning from Imbalanced Classes via Data Preprocessing: A Data-Driven Application in Metabolomics Data Mining

This paper presents a data mining application in metabolomics. It aims at building an enhanced machine learning classifier that can be used for diagnosing cachexia syndrome and identifying its involved biomarkers. To achieve this goal, a data-driven analysis is carried out using a public dataset consisting of 1H-NMR metabolite profile. This dataset suffers from the problem of imbalanced classes...

متن کامل

Using Self-organizing Maps for Binary Classification with Highly Imbalanced Datasets

Highly imbalanced datasets occur in domains like fraud detection, fraud prediction, and clinical diagnosis of rare diseases, among others. These datasets are characterized by the existence of a prevalent class (e.g. legitimate sellers) while the other is relatively rare (e.g. fraudsters). Although small in proportion, the observations belonging to the minority class can be of a crucial importan...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009